Skip to content

[CoreML EP] Support pre-opset-13 Split via 'split' attribute#28270

Merged
yuslepukhin merged 7 commits into
microsoft:mainfrom
maxwbuckley:coreml-split-attr
May 6, 2026
Merged

[CoreML EP] Support pre-opset-13 Split via 'split' attribute#28270
yuslepukhin merged 7 commits into
microsoft:mainfrom
maxwbuckley:coreml-split-attr

Conversation

@maxwbuckley
Copy link
Copy Markdown
Contributor

@maxwbuckley maxwbuckley commented Apr 29, 2026

Description

The CoreML SplitOpBuilder previously gated GetMinSupportedOpSet at 13 because pre-13 Split carries split sizes via an INTS attribute rather than a second input. This PR lowers the gate to 1 and reads the attribute in both the MLProgram and NeuralNetwork emitters, so Split from any opset is supported on the CoreML EP.

The validation in IsOpSupportedImpl mirrors the existing input-form rules — ≥2 outputs, sum of sizes equals the axis dim, all sizes positive, axis dim not dynamic. For the no-attribute / no-input case (legacy even-split) we also explicitly require the axis dim to be evenly divisible by num_outputs, since CoreML's num_splits requires that. This is a behavior change only for opset 2–12 graphs that were 100% rejected before, so no path that used to work regresses.

Motivation

DWPose dw-ll_ucoco_384.onnx (opset 11), a common pose-estimation model, has two Split nodes — one uneven (split=[512, 512, 128]) and one even (split=[1, 1]). Both fall back to CPU today, fragmenting the CoreML partition.

Without this PR With this PR
CoreML partitions 3 1
Nodes on CoreML EP 301 / 303 303 / 303

Benchmark — M3 Max, MLProgram, batch 1, 1299-iter steady state

Metric Without PR With PR Δ
Mean 6.838 ms 6.565 ms −4.0%
StdDev 0.239 ms 0.170 ms −29%
P50 6.810 ms 6.545 ms −3.9%
P95 7.070 ms 6.775 ms −4.2%
P99 7.330 ms 6.928 ms −5.5%
P99.9 8.917 ms 8.164 ms −8.4%
Max 12.616 ms 10.360 ms −17.9%

Removing the two CPU↔CoreML round trips improves the tail far more than the median — useful for real-time perception pipelines where worst-case latency determines the frame budget.

Tests

Eight new tests in onnxruntime/test/providers/coreml/coreml_basic_test.cc, each exercising both the NeuralNetwork and MLProgram emitters and asserting full CoreML EP node assignment (no CPU fallback).

Pre-opset-13 attribute form (the new code path):

  • Split7UnevenAttribute — opset 7 uneven split=[4, 3, 2], covering the opset 7–10 range.
  • Split11UnevenAttribute — DWPose's pattern, split=[4, 3, 2].
  • Split11EvenAttribute — uniform sizes via attribute.
  • Split11NoAttributeEven — falls through to the num_splits = num_outputs branch.

Post-opset-13 input form (parity with the existing, untouched path):

  • Split13UnevenInputsplit input [4, 3, 2].
  • Split13EvenInput — uniform sizes via input.
  • Split13NoSplitInputEven — no split input, even-split fallback.

Negative coverage:

  • Split11ZeroSplitValueNotSupported — verifies the attribute-form rejection of a non-positive entry; expects no CoreML assignment.

All eight pass locally on macOS 26.3 / M3 Max.

Motivation for upstreaming

Most pre-2023 vision exports (DWPose, MMPose models, original YOLOv5/v7/v8, etc.) target ONNX opset 11/12 and use the Split attribute form. They currently lose any Split to CPU on the CoreML EP. This is a self-contained gap with a clean fix.

The CoreML SplitOpBuilder gated `GetMinSupportedOpSet` at 13 because
pre-13 Split carries split sizes via an INTS attribute rather than a
second input. Lowers the gate to 1 and reads the attribute in both
the MLProgram and NeuralNetwork emitters. Adds validation matching
the existing input-form path (>=2 outputs, sum equals axis size, no
zero entries, axis size not dynamic). For the no-attribute / no-input
case (legacy even split) we now also explicitly require the axis size
to be evenly divisible by num_outputs, since CoreML's `num_splits`
demands it; this only affects opset 2-12 graphs that were 100%
rejected before, so no existing path regresses.

Motivated by DWPose `dw-ll_ucoco_384.onnx` (opset 11), which has two
Split nodes — one uneven (`split=[512, 512, 128]`), one even
(`split=[1, 1]`) — both falling back to CPU and fragmenting the
CoreML partition. Without this change: 3 partitions, 301/303 nodes
on CoreML EP. With it: 1 partition, 303/303 nodes on CoreML EP. On
M3 Max (1299-iter timed run, batch 1, MLProgram):

  P50    6.81 ms  -> 6.55 ms
  P99    7.33 ms  -> 6.93 ms
  Max   12.62 ms  -> 10.36 ms (-17.9%)
  StdDev 0.24 ms  ->  0.17 ms (-29%)
  CPU EP baseline P50 38.7 ms

Adds three tests covering both NeuralNetwork and MLProgram emitters:

  Split11UnevenAttribute    — uneven sizes via attribute
  Split11EvenAttribute      — uniform sizes via attribute path
  Split11NoAttributeEven    — fall-through to num_splits branch

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maxwbuckley
Copy link
Copy Markdown
Contributor Author

Refactored the tests to match the HardSigmoid PR (#28182) pattern:

  • Use onnxruntime::Model / Graph API instead of raw ModelProto construction.
  • Pin opset 11 explicitly via domain_to_version so the attribute form is exercised.
  • Run both NeuralNetwork and MLProgram emitters in each test (the NN code path I changed wasn't covered before).
  • Inline per-test, no shared helper, matching the style of the surrounding HardSigmoid / Pad tests.
  • Replaced the synthetic 1152-element DWPose-shape input with smaller (1,9 / 1,6 / 1,8) tensors that still cover the same code paths.

All three tests pass locally (NN + MLProgram for each scenario).

maxwbuckley added a commit to maxwbuckley/onnxruntime that referenced this pull request Apr 29, 2026
The CoreML GatherOpBuilder rejected rank-0 (scalar) indices because
CoreML's `gather` requires rank-1+ indices and the workaround would
change the output rank. Fixes microsoft#28180 by performing the workaround
internally: when indices are scalar, the builder now emits

  expand_dims(indices, axes=[0])  ->  indices_1d : shape [1]
  gather(data, indices_1d, axis)  ->  shape data_shape with axis=1
  squeeze(., axes=[axis])         ->  ONNX gather output shape

in both the MLProgram and NeuralNetwork emitters. This requires
claiming a static intermediate shape, so we restrict the fast path
to fully-static `data` shape; dynamic-shape scalar Gather still
falls back to CPU (existing behavior).

Motivated by StyleGAN-family generators (e.g. GFPGAN), which select
per-layer style codes with a scalar-index Gather. Combined with
microsoft#28270 (pre-opset-13 Split), GFPGAN-1024 (input 1x3x512x512) goes
from 9 partitions / 534 CoreML nodes to 1 partition / 557 CoreML
nodes on M3 Max. 100-iter timed run, MLProgram:

                          partitions   mean      P99    max
  origin/main             9            118.0 ms  131.0  131.3
  Gather fix only         8            118.1 ms  134.1  134.7
  Gather + Split (microsoft#28270) 1             81.8 ms   96.7   98.1

The Gather fix alone trims one graph break (the 16 scalar Gathers
collapse into the surrounding CoreML partitions); the full latency
win lands when both fixes ship.

Adds three tests covering both NeuralNetwork and MLProgram emitters:
  GatherScalarIndicesAxis1     - axis=1 (squeeze inserted axis)
  GatherScalarIndicesAxis0     - axis=0 (squeeze leading axis)
  GatherScalarIndicesNegativeAxis  - axis=-1 (HandleNegativeAxis path)

Updates the comment on the existing GatherWithScalarIndices test to
reflect that scalar Gather is now supported when 'data' is fully
static (the test continues to exercise the dynamic-shape fall-back).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
maxwbuckley added a commit to maxwbuckley/onnxruntime that referenced this pull request Apr 29, 2026
The CoreML GatherOpBuilder rejected rank-0 (scalar) indices because
CoreML's `gather` requires rank-1+ indices and the workaround would
change the output rank. Fixes microsoft#28180 by performing the workaround
internally: when indices are scalar, the builder now emits

  reshape(indices, shape=[1])     ->  indices_1d
  gather(data, indices_1d, axis)  ->  shape data_shape with axis=1
  squeeze(., axes=[axis])         ->  ONNX gather output shape

in both the MLProgram and NeuralNetwork emitters. We use `reshape`
rather than `expand_dims` because CoreML internally pads scalars and
expand_dims on the padded tensor can push the apparent rank past
CoreML's rank-5 limit.

Restrictions:
  - 'data' must have a fully static shape (we claim a static
    intermediate shape between gather and squeeze).
  - 'data' rank capped at 4 (the rank-5 case still trips CoreML's
    compiler with "Invalid rank: 6", so we keep the conservative
    boundary).

Dynamic-shape and rank-5+ scalar Gather still falls back to CPU.

Motivated by StyleGAN-family generators (e.g. GFPGAN), which select
per-layer style codes with a scalar-index Gather. Combined with
microsoft#28270 (pre-opset-13 Split), GFPGAN-1024 (input 1x3x512x512) goes
from 9 partitions / 534 CoreML nodes to 1 partition / 557 CoreML
nodes on M3 Max.

Adds six tests covering both NeuralNetwork and MLProgram emitters:
  GatherScalarIndicesAxis1         - axis=1, mid-rank squeeze
  GatherScalarIndicesAxis0         - axis=0, leading-axis squeeze
  GatherScalarIndicesNegativeAxis  - axis=-1, HandleNegativeAxis path
  GatherScalarIndicesFloat16       - fp16 data (MLProgram only)
  GatherScalarIndicesInt64Data     - int64 data, both formats
  GatherScalarIndicesRank4Data     - rank-4 data, exercises the cap

Updates the comment on the existing GatherWithScalarIndices test to
reflect that scalar Gather is now supported when 'data' is fully
static and rank-4 or below; the test continues to exercise the
dynamic-shape fall-back.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds three tests that mirror the opset-11 attribute-form Split tests using
the opset-13 input form (split sizes as a constant initializer):

  - Split13UnevenInput        (parity with Split11UnevenAttribute)
  - Split13EvenInput          (parity with Split11EvenAttribute)
  - Split13NoSplitInputEven   (parity with Split11NoAttributeEven)

Each test runs both the NeuralNetwork and MLProgram emitters, compares
outputs against the CPU EP, and asserts ExpectedEPNodeAssignment::All so
any CoreML fallback would surface as a failure.

Also drops the redundant "Pre-opset-13 split attribute is also supported"
sentence from coreml_supported_mlprogram_ops.md — the table already lists
the operator, and the attribute-form support is documented in the PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maxwbuckley
Copy link
Copy Markdown
Contributor Author

Added a follow-up commit (b1989fc) addressing two things:

1. Opset-13 input-form parity tests

The original three tests only exercised the new pre-opset-13 attribute path. To prove the post-opset-13 input-form path (the existing, untouched code branch) still behaves identically, this commit adds three mirror tests with the same shapes and split sizes:

New test Mirrors Path exercised in split_op_builder.cc
Split13UnevenInput Split11UnevenAttribute input-form (input_defs.size() > 1) — MLProgram L68–74 / NN L111–119
Split13EvenInput Split11EvenAttribute same path, uniform sizes
Split13NoSplitInputEven Split11NoAttributeEven no split input, SinceVersion() < 18 even-split fallback — MLProgram L79–82 / NN L125–127

Each test runs both the NeuralNetwork and MLProgram emitters via RunAndVerifyOutputsWithEP, which:

  • Compares outputs element-wise against the CPU EP baseline (1e-5 fp32 tolerance), and
  • Asserts ExpectedEPNodeAssignment::All, so any CoreML fallback would fail the test rather than silently regress.

This locks in the parity argument from the PR description ("no path that used to work regresses") with explicit coverage rather than relying on it being implicit.

All 6 tests (3 original + 3 new) pass locally on macOS 26.3 / M3 Max:

[ OK ] Split11UnevenAttribute     (64 ms)
[ OK ] Split11EvenAttribute       (33 ms)
[ OK ] Split11NoAttributeEven     (30 ms)
[ OK ] Split13UnevenInput         (43 ms)
[ OK ] Split13EvenInput           (46 ms)
[ OK ] Split13NoSplitInputEven    (42 ms)

2. Ops doc cleanup

Removed the redundant "Pre-opset-13 split attribute is also supported." sentence from coreml_supported_mlprogram_ops.md. The presence of ai.onnx:Split in the table is already the support claim, and tying it to a specific opset boundary in the doc is more noise than signal once both forms are supported.

The two-line comment explaining that pre-opset-13 Split uses an attribute
form was anchored to GetMinSupportedOpSet, but the relevant context for a
reader is in IsOpSupportedImpl, which already documents the attribute path
inline. Removing the duplicate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maxwbuckley
Copy link
Copy Markdown
Contributor Author

@yuslepukhin greatly appreciate your help and guidance on this and my other Apple Onnxruntime optimizations :)

@yuslepukhin yuslepukhin requested a review from Copilot May 5, 2026 18:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds CoreML EP support for legacy ONNX Split nodes that encode split sizes in the pre-opset-13 split attribute, and expands CoreML Split test coverage.

Changes:

  • Lowers the CoreML Split minimum supported opset from 13 to 1.
  • Adds pre-opset-13 split-attribute handling in both MLProgram and NeuralNetwork emitters, plus validation for legacy even-split behavior.
  • Adds new CoreML tests covering Split opset 11 and opset 13 for uneven, even, and implicit-even split forms.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
onnxruntime/core/providers/coreml/builders/impl/split_op_builder.cc Enables legacy Split support in the CoreML builder and updates support validation.
onnxruntime/test/providers/coreml/coreml_basic_test.cc Adds Split coverage for opset 11/13 across attribute, input, and implicit-even split forms.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/providers/coreml/builders/impl/split_op_builder.cc
Comment thread onnxruntime/core/providers/coreml/builders/impl/split_op_builder.cc Outdated
Comment thread onnxruntime/core/providers/coreml/builders/impl/split_op_builder.cc
Comment thread onnxruntime/core/providers/coreml/builders/impl/split_op_builder.cc
Comment thread onnxruntime/core/providers/coreml/builders/impl/split_op_builder.cc Outdated
Comment thread onnxruntime/test/providers/coreml/coreml_basic_test.cc
maxwbuckley and others added 2 commits May 5, 2026 22:17
- Reject non-positive entries (not just zero) in the 'split' attribute
  validation; previously [5, -1] would pass since the sum still matched
  the axis size.
- Split the combined dynamic-axis / non-divisible branch in the omitted-
  split path into two distinct conditions so the verbose log no longer
  blames divisibility when the actual cause is a dynamic axis.
- Add opset-7 attribute test (Split7UnevenAttribute) for the <=10 range
  the builder now advertises.
- Add two negative tests verifying CoreML rejects unsupported configs:
  Split11ZeroSplitValueNotSupported (zero entry in 'split') and
  Split11OmittedNonDivisibleAxisNotSupported (omitted split with
  axis size not divisible by num outputs).
ONNX shape inference for Split-11 with omitted 'split' rejects models
whose declared output shapes don't match an even split, so the test
fails at graph.Resolve() before the CoreML EP gets a chance to apply
its own divisibility rejection. The CoreML check at lines 187-191 of
split_op_builder.cc remains as defensive code, but it isn't reachable
through a well-formed model. Remove the unreachable test.

The other negative case from the Copilot review
(Split11ZeroSplitValueNotSupported) covers the new non-positive entry
rejection and stays in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread onnxruntime/core/providers/coreml/builders/impl/split_op_builder.cc Outdated
Comment thread onnxruntime/test/providers/coreml/coreml_basic_test.cc
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/providers/coreml/builders/impl/split_op_builder.cc
Comment thread onnxruntime/core/providers/coreml/builders/impl/split_op_builder.cc Outdated
Comment thread onnxruntime/core/providers/coreml/builders/impl/split_op_builder.cc
maxwbuckley and others added 2 commits May 6, 2026 09:13
When the splitting axis is dynamic (split_dims_at_axis == -1), the sum
of split sizes — always positive after the explicit positivity check —
can never equal -1, so the mismatch check fired first with a misleading
"sum of split doesn't match axis size" message. The actual cause was
that the axis is dynamic, which CoreML doesn't allow.

Reorder both the input-form and attribute-form paths so the dynamic-axis
check runs before the sum check. The pre-existing latent bug in the
input-form path (untouched by this PR's behavior change) is fixed at the
same time, per reviewer request.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ONNX schema and the CPU Split kernel both accept a Split node with a
single output, so the CoreML rejection (split_attr->size() < 2 in the
attribute-form path) is observable via partition assertion and worth
locking in.

The other negative cases yuslepukhin requested — sum mismatch,
non-divisible omitted split, negative split value — are caught earlier
by ONNX shape inference and CPU's SplitBase Initialize check, so the
malformed model never reaches the partitioner with a CPU fallback path
to compare against. The CoreML checks for those cases remain as
defensive code but cannot be exercised in isolation through the public
graph-builder API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yuslepukhin yuslepukhin merged commit 470977a into microsoft:main May 6, 2026
85 of 87 checks passed
maxwbuckley added a commit to maxwbuckley/onnxruntime that referenced this pull request May 7, 2026
Resolves conflict in onnxruntime/test/providers/coreml/coreml_basic_test.cc
where this branch's FusedConv test helpers + 6 tests landed in the same
file region as the Split11/13/7 tests merged via microsoft#28270. Both sets are
preserved sequentially.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
maxwbuckley added a commit to maxwbuckley/onnxruntime that referenced this pull request May 7, 2026
Resolves conflict in coreml_basic_test.cc where this branch's
tile/ceil/identity helpers + tests landed in the same file region as
the Split11/13/7 tests merged via microsoft#28270. Both sets are preserved
sequentially.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
maxwbuckley added a commit to maxwbuckley/onnxruntime that referenced this pull request May 7, 2026
Resolves conflict in coreml_basic_test.cc where this branch's
GatherScalarIndices* tests landed in the same file region as the
Split11/13/7 tests merged via microsoft#28270. Both sets are preserved
sequentially (Gather first, then Split).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants